Appendix B — Assignment 2

Instructions

  1. You may talk to a friend, discuss the questions and potential directions for solving them. However, you need to write your own solutions and code separately, and not as a group activity.

  2. Write your code in the Code cells and your answers in the Markdown cells of the Jupyter notebook. Ensure that the solution is written neatly enough to for the graders to understand and follow.

  3. Use Quarto to render the .ipynb file as HTML. You will need to open the command prompt, navigate to the directory containing the file, and use the command: quarto render filename.ipynb --to html. Submit the HTML file.

  4. The assignment is worth 100 points, and is due on Monday, 18th April 2025 at 11:59 pm.

  5. Five points are properly formatting the assignment. The breakdown is as follows:

    • Must be an HTML file rendered using Quarto (1 point). If you have a Quarto issue, you must mention the issue & quote the error you get when rendering using Quarto in the comments section of Canvas, and submit the ipynb file.
    • No name can be written on the assignment, nor can there be any indicator of the student’s identity—e.g. printouts of the working directory should not be included in the final submission. (1 point)
    • There aren’t excessively long outputs of extraneous information (e.g. no printouts of entire data frames without good reason, there aren’t long printouts of which iteration a loop is on, there aren’t long sections of commented-out code, etc.) (1 point)
    • Final answers to each question are written in the Markdown cells. (1 point)
    • There is no piece of unnecessary / redundant code, and no unnecessary / redundant text. (1 point)
  6. The maximum possible score in the assigment is 103+5 = 108 out of 100.

B.1 Optimizing KNN for Classification (71 points)

In this question, you will use classification_data.csv. Each row is a loan and the each column represents some financial information as follows:

  • hi_int_prncp_pd: Indicates if a high percentage of the repayments went to interest rather than principal. This is the classification response.

  • out_prncp_inv: Remaining outstanding principal for portion of total amount funded by investors

  • loan_amnt: The listed amount of the loan applied for by the borrower. If at some point in time, the credit department reduces the loan amount, then it will be reflected in this value.

  • int_rate: Interest Rate on the loan

  • term: The number of payments on the loan. Values are in months and can be either 36 or 60.

As indicated above, hi_int_prncp_pd is the response and all the remaining columns are predictors. You will tune and train a K-Nearest Neighbors (KNN) classifier throughout this question.

B.1.1 a) Load the Dataset (1 point)

Read the dataset into your notebook.

B.1.2 b) Define Predictor and Response Variables (1 point)

Create the predictor (features) and response (target) variables from the dataset.

B.1.3 c) Split the Data into Training and Test Sets (1 points)

Create the training and test datasets using the following specifications:

  • Use a 75%-25% split.
  • Ensure that the class ratio is preserved in both training and test sets (i.e., stratify the split).
  • Set random_state=45 for reproducibility.

B.1.4 d) Check Class Ratios (2 points)

Print the class distribution (ratios) for:

  • The entire dataset
  • The training set
  • The test set

This is to verify that the class ratio is preserved after splitting.

B.1.5 e) Scale the Dataset (2 points)

Use StandardScaler to scale the dataset in order to prepare it for KNN modeling.

Scaling ensures that all features contribute equally to the distance calculations used by the KNN algorithm.

B.1.6 f) Set Up Cross-Validation (2 points)

Before creating and tuning your model, you need to define cross-validation settings to ensure consistent and accurate evaluation across folds.

Please follow these specifications:

  • Use 5 stratified folds to preserve class distributions in each split.
  • Shuffle the data before splitting to introduce randomness.
  • Set random_state=14 for reproducibility.

Note: You must use these exact cross-validation settings throughout the rest of this question to maintain consistency.

B.1.7 g) Tune K for KNN Using Cross-Validation (12 points)

Tune a KNN Classifier using cross-validation with the following specifications:

  • Use every odd K value from 1 to 50 (inclusive).
  • Keep all other model settings at their defaults.
  • Use the cross-validation settings defined in part (f).
  • Evaluate performance using the F1 score as the metric.

(4 points)

Then, complete the following tasks:

  • Create a plot of K values vs. cross-validation F1 scores to visualize how K balances overfitting and underfitting. (2 points)
  • Print the best average cross-validation F1 score. (1 points)
  • Report the K value corresponding to the best F1 score. (1 points)
  • Determine whether this is the only K value that results in the best F1 score. Use code to justify your answer. (2 points)
  • Reflect on whether accuracy is a good metric for tuning the model in this case. Explain your reasoning. (2 points)

💡 Hint:

In addition to reporting the best K and best F1 score, you may also want to examine the full cross-validation results to check if other K values achieved the same F1 score.

B.1.8 h) Optimize the Classification Threshold (4 points)

Using the optimal K value you identified in part (g), optimize the classification threshold to maximize the cross-validation F1 score.

B.1.8.1 Specifications:

  • Search across all possible threshold values using a step size of 0.05.
  • Use the cross-validation settings defined in part (f).
  • Evaluate performance using the F1 score, consistent with part (g).

B.1.8.2 Tasks:

  • Visualize the F1 score vs. different threshold values. (2 points)
  • Identify and report the best threshold that yields the highest F1 score. (1 points)
  • Output the best cross-validation F1 score. (1 points)

B.1.9 i) Evaluate the Tuning Method (2 points)

Is the method we used in parts (g) and (h) guaranteed to find the best combination of K and threshold, i.e., to tune the classifier to its optimal values?
(1 point)

Justify your answer.
(1 point)

B.1.10 j) Evaluate Tuned Classifier on Test Set (3 points)

Using the tuned KNN classifier and the optimal threshold you identified, evaluate the model on the test set. Report the following metrics:

  • F1 Score
  • Accuracy
  • Precision
  • Recall
  • AUC

B.1.11 k) Jointly Tune K and Threshold (6 points)

Now, tune K and the classification threshold simultaneously, rather than sequentially.

  • Use the same settings from the previous parts (i.e., odd K values from 1 to 50, threshold step size of 0.05, F1 score as the metric, and the same cross-validation strategy).
  • Identify the best F1 score, along with the K value and threshold that produce it.

B.1.12 l) Visualize Cross-Validation Results with a Heatmap (3 points)

Create a heatmap to visualize the cross-validation results in two dimensions.

  • The x-axis should represent the K values.
  • The y-axis should represent the threshold values.
  • The color should represent the F1 score.

Note: This question only requires one line of code. You’ll need to recall a data visualization function and a data reshaping method from 303-1.

B.1.13 m) Compare Joint vs. Sequential Tuning Results (4 points)

  • How does the best cross-validation F1 score from part (k) compare to the scores from parts (g) and (h)? (1 point)
  • Did the optimal K value and threshold change when tuning them jointly? (1 point)
  • Explain why or why not. Consider how tuning the two parameters together might impact the result. (2 points)

B.1.14 n) Evaluate Final Tuned Model on Test Set (3 points)

Using the tuned classifier and threshold from part (k), evaluate the model on the test set. Report the following metrics:

  • F1 Score
  • Accuracy
  • Precision
  • Recall
  • AUC

B.1.15 o) Compare Tuning Strategies and Computational Cost (3 points)

Compare the tuning approach used in parts (g) & (h) (separate tuning of K and threshold) with the approach in part (k) (joint tuning of K and threshold) in terms of computational cost.

  • How many K and threshold combinations did you evaluate in each approach? (2 points)
  • Based on this comparison and your answer from part (l), explain the main trade-off involved in model tuning (e.g., between computation and performance). (2 points)

B.1.16 p) Tune K Using Multiple Metrics (5 points)

GridSearchCV or cross_val_score only allows tuning based on a single metric. In this part, you’ll practice tuning hyperparameters while evaluating multiple metrics simultaneously.

Cross-validate a KNN classifier using the following specifications:

  • Use every odd K value from 1 to 50 (inclusive), and keep all other hyperparameters at their default settings.
  • Apply the cross-validation settings from part (f).
  • Evaluate the model using accuracy, precision, and recall as metrics at the same time.

Save the cross-validation results into a DataFrame, and compute the average score for each metric, and visualize how these metrics change with different values of K.

B.1.17 q) Optimize for Precision with Recall Constraint (4 point)

Identify the K value that yields the highest precision, while maintaining a recall of at least 75%.
(3 points)

Then, print the average cross-validation metrics (accuracy, precision, recall) for that K value.
(1 point)

B.1.18 r) Tune Threshold for Maximum Precision (3 point)

Using the optimal K value identified in part (q), find the threshold that maximizes cross-validation precision, following the specifications below:

  • Evaluate all possible threshold values with a step size of 0.05.
  • Use the cross-validation settings from part (f).

Then: - Print the best cross-validation precision. - Report the threshold value that achieves this precision.

Note: This task is very similar to part (h), but it’s important for the next part.

B.1.19 s) Evaluate Precision-Optimized Model on Test Set (2 points)

Using the tuned classifier and threshold from parts (q) and (r), evaluate the model on the test set. Report the following metrics:

  • Test Accuracy
  • Test Precision
  • Test Recall
  • Test AUC

B.1.20 t) Final Reflection: Comparing Tuning Strategies (3 points)

You have now tuned your KNN classifier using three different strategies:

  1. Sequential tuning of K and threshold based on F1 score (parts g–h)
  2. Joint tuning of K and threshold using F1 score (part k)
  3. Tuning based on multiple metrics, selecting the K with the highest precision while maintaining recall ≥ 75% (parts p–r)

Reflect on the following:

  • Which tuning strategy led to the best overall performance on the test set, based on the metrics you care about most?
  • Which strategy would you choose in a real-world application, and why?
  • What are the trade-offs between tuning for F1 score versus prioritizing precision or recall individually?

Note: This is an open-ended question. As long as your reasoning makes sense, you will receive full credit.

B.2 Tuning a KNN Regressor on Bank Loan Data (32 points)

In this question, you will use bank_loan_train_data.csv to tune (the model hyperparameters) and train the model. Each row is a loan and the each column represents some financial information as follows:

  • money_made_inv: Indicates the amount of money made by the bank on the loan. This is the regression response.

  • out_prncp_inv: Remaining outstanding principal for portion of total amount funded by investors

  • loan_amnt: The listed amount of the loan applied for by the borrower. If at some point in time, the credit department reduces the loan amount, then it will be reflected in this value.

  • int_rate: Interest Rate on the loan

  • term: The number of payments on the loan. Values are in months and can be either 36 or 60

  • mort_acc: The number of mortgage accounts

  • application_type_Individual: 1 if the loan is an individual application or a joint application with two co-borrowers

  • tot_cur_bal: Total current balance of all accounts

  • pub_rec: Number of derogatory public records

As indicated above, money_made_inv is the response and all the remaining columns are predictors. You will tune and train a K-Nearest Neighbors (KNN) regressor throughout this question.

B.2.1 a) Split, Scale, and Tune a KNN Regressor (15 point)

Create the training and test datasets using the following specifications:

  • Use an 80%-20% split.
  • Set random_state=1 for reproducibility.

Then, scale your data, as KNN is sensitive to the scale of input features.

Next, you will tune a KNN Regressor by searching for the optimal hyperparameters using three search approaches: Grid Search, Random Search, and Bayesian Search.

B.2.1.1 Cross-Validation Setting

You should use 5-fold cross-validation, with the following specifications:

  • The data should be shuffled before splitting
  • Use random_state=1 to ensure reproducibility

B.2.1.2 Hyperparameters to Tune:

You will tune the following hyperparameters for the KNN Regressor:

You will tune the following hyperparameters for the K-Nearest Neighbors Regressor, using Minkowski as the distance metric:

  1. n_neighbors: Number of nearest neighbors

    • Tune over the range: np.arange(1, 25, 1)
  2. p: Power parameter for the Minkowski distance

    • Use values: np.arange(1, 4, 1)
    • p = 1 corresponds to Manhattan distance
    • p = 2 corresponds to Euclidean distance
    • Note: Set the distance metric to "minkowski"
  3. weights: Weight function used in prediction
    You must consider the following 5 types of weights:

    • 'uniform': All neighbors are weighted equally
    • 'distance': Weight is inversely proportional to distance
    • Custom weight functions:
      • 1distance2\propto \frac{1}{\text{distance}^2}
      • 1distance3\propto \frac{1}{\text{distance}^3}
      • 1distance4\propto \frac{1}{\text{distance}^4}

For each search method (Grid Search, Random Search, Bayesian Search), report the following:

  • best_params_: The best combination of hyperparameters
  • best_score_: Cross-validated RMSE on the training set
  • Test RMSE obtained from the best model
  • Execution time for the search process

Hint:

Define three custom weight functions as shown below:

def dist_power_2(distance):
    return 1 / (1e-10 + distance**2)

def dist_power_3(distance):
    return 1 / (1e-10 + distance**3)

def dist_power_4(distance):
    return 1 / (1e-10 + distance**4)

Note the small constant 1e-10 helps avoid division by zero and numerical instability.

B.2.2 b) Compare Tuning Approaches (1 point)

Compare the results from part (2a) in terms of execution time and model performance.
Briefly discuss the main trade-offs among the three hyperparameter tuning approaches: Grid Search, Random Search, and Bayesian Search.

B.2.3 c) Feature Selection and Hyperparameter Tuning with GridSearchCV (15 point)

KNN performance can deteriorate significantly if irrelevant or noisy predictors are included. In this part, you will explore feature selection to improve model performance, followed by hyperparameter tuning using GridSearchCV (with refit=True).

Try the following four different feature selection approaches:

  1. Correlation-based filtering:
    • Select features with an absolute correlation of at least 0.1 with the target variable.
  2. Lasso regression for feature selection:
    • Use Lasso(alpha=50) to select important features based on non-zero coefficients.
  3. SelectKBest:
    • Use SelectKBest with f_regression, selecting the top 4 features.

For each approach, perform hyperparameter tuning using GridSearchCV, and report:

  • The best score (cross-validated RMSE) on the training set
  • The test RMSE from the best model
  • The best hyperparameters

B.2.4 d) Compare Feature Selection Approaches (1 point)

Create a DataFrame that summarizes the model performance from each feature selection method, including:

  • Training RMSE
  • Test RMSE

Be sure to also include the results from the model trained without any feature selection for comparison.

Then, briefly explain what you learned from this experiment.
For example: Did feature selection improve performance? Which method worked best?